Yang LI Junyong YE Tongqing WANG Shijian HUANG
Traditional sparse representation-based methods for human action recognition usually pool over the entire video to form the final feature representation, neglecting any spatio-temporal information of features. To employ spatio-temporal information, we present a novel histogram representation obtained by statistics on temporal changes of sparse coding coefficients frame by frame in the spatial pyramids constructed from videos. The histograms are further fed into a support vector machine with a spatial pyramid matching kernel for final action classification. We validate our method on two benchmarks, KTH and UCF Sports, and experiment results show the effectiveness of our method in human action recognition.
Zhong ZHANG Shuang LIU Xing MEI
The bag-of-words model (BOW) has been extensively adopted by recent human action recognition methods. The pooling operation, which aggregates local descriptor encodings into a single representation, is a key determiner of the performance of the BOW-based methods. However, the spatio-temporal relationship among interest points has rarely been considered in the pooling step, which results in the imprecise representation of human actions. In this paper, we propose a novel pooling strategy named contextual max pooling (CMP) to overcome this limitation. We add a constraint term into the objective function under the framework of max pooling, which forces the weights of interest points to be consistent with their probabilities. In this way, CMP explicitly considers the spatio-temporal contextual relationships among interest points and inherits the positive properties of max pooling. Our method is verified on three challenging datasets (KTH, UCF Sports and UCF Films datasets), and the results demonstrate that our method achieves better results than the state-of-the-art methods in human action recognition.
Recently, locality-constrained linear coding (LLC) as a coding strategy has attracted much attention, due to its better reconstruction than sparse coding and vector quantization. However, LLC ignores the weight information of codewords during the coding stage, and assumes that every selected base has same credibility, even if their weights are different. To further improve the discriminative power of LLC code, we propose a weighted LLC algorithm that considers the codeword weight information. Experiments on the KTH and UCF datasets show that the recognition system based on WLLC achieves better performance than that based on the classical LLC and VQ, and outperforms the recent classical systems.
Changhong CHEN Shunqing YANG Zongliang GAN
Cross-view action recognition is a challenging research field for human motion analysis. Appearance-based features are not credible if the viewpoint changes. In this paper, a new framework is proposed for cross-view action recognition by topic based knowledge transfer. First, Spatio-temporal descriptors are extracted from the action videos and each video is modeled by a bag of visual words (BoVW) based on the codebook constructed by the k-means cluster algorithm. Second, Latent Dirichlet Allocation (LDA) is employed to assign topics for the BoVW representation. The topic distribution of visual words (ToVW) is normalized and taken to be the feature vector. Third, in order to bridge different views, we transform ToVW into bilingual ToVW by constructing bilingual dictionaries, which guarantee that the same action has the same representation from different views. We demonstrate the effectiveness of the proposed algorithm on the IXMAS multi-view dataset.
Wen ZHOU Chunheng WANG Baihua XIAO Zhong ZHANG Yunxue SHAO
Recognizing human action in complex scenes is a challenging problem in computer vision. Some action-unrelated concepts, such as camera position features, could significantly affect the appearance of local spatio-temporal features, and therefore the performance of low-level features based methods degrades. In this letter, we define the action-unrelated concept: the position of camera as high-level features. We observe that they can serve as a prior to local spatio-temporal features for human action recognition. We encode this prior by modeling interactions between spatio-temporal features and camera position features. We infer camera position features from local spatio-temporal features via these interactions. The parameters of this model are estimated by a new max-margin algorithm. We evaluate the proposed method on KTH, IXMAS and Youtube actions datasets. Experimental results show the effectiveness of the proposed method.
Hongbo ZHANG Shaozi LI Songzhi SU Shu-Yuan CHEN
Many successful methods for recognizing human action are spatio-temporal interest point (STIP) based methods. Given a test video sequence, for a matching-based method using a voting mechanism, each test STIP casts a vote for each action class based on its mutual information with respect to the respective class, which is measured in terms of class likelihood probability. Therefore, two issues should be addressed to improve the accuracy of action recognition. First, effective STIPs in the training set must be selected as references for accurately estimating probability. Second, discriminative STIPs in the test set must be selected for voting. This work uses ε-nearest neighbors as effective STIPs for estimating the class probability and uses a variance filter for selecting discriminative STIPs. Experimental results verify that the proposed method is more accurate than existing action recognition methods.
Temporal Self-Similarity Matrix (SSM) based action recognition is one of the important approaches of single-person oriented action analysis in computer vision. In this study, we propose a new kind of SSM and a fast computation method. The computation method does not require time-consuming pre-processing to find bounding boxes of the human body, instead it processes difference images to obtain action patterns which can be done very quickly. The proposed SSM is experimentally confirmed to have high power/capacity to achieve a better classification performance than four typical kinds of SSMs.
Guoliang LU Mineichi KUDO Jun TOYAMA
Vision based human action recognition has been an active research field in recent years. Exemplar matching is an important and popular methodology in this field, however, most previous works perform exemplar matching on the whole input video clip for recognition. Such a strategy is computationally expensive and limits its practical usage. In this paper, we present a martingale framework for selection of characteristic frames from an input video clip without requiring any prior knowledge. Action recognition is operated on these selected characteristic frames. Experiments on 10 studied actions from WEIZMANN dataset demonstrate a significant improvement in computational efficiency (54% reduction) while achieving the same recognition precision.
Il-Woong JEONG Jin CHOI Kyusung CHO Yong-Ho SEO Hyun Seung YANG
Detecting emergency situation is very important to a surveillance system for people like elderly live alone. A vision-based emergency response system with a paramedic mobile robot is presented in this paper. The proposed system is consisted of a vision-based emergency detection system and a mobile robot as a paramedic. A vision-based emergency detection system detects emergency by tracking people and detecting their actions from image sequences acquired by single surveillance camera. In order to recognize human actions, interest regions are segmented from the background using blob extraction method and tracked continuously using generic model. Then a MHI (Motion History Image) for a tracked person is constructed by silhouette information of region blobs and model actions. Emergency situation is finally detected by applying these information to neural network. When an emergency is detected, a mobile robot can help to diagnose the status of the person in the situation. To send the mobile robot to the proper position, we implement mobile robot navigation algorithm based on the distance between the person and a mobile robot. We validate our system by showing emergency detection rate and emergency response demonstration using the mobile robot.
The frequency response of log-Gabor function matches well the frequency response of primate visual neurons. In this letter, motion-salient regions are extracted based on the 2D log-Gabor wavelet transform of the spatio-temporal form of actions. A supervised classification technique is then used to classify the actions. The proposed method is robust to the irregular segmentation of actors. Moreover, the 2D log-Gabor wavelet permits more compact representation of actions than the recent neurobiological models using Gabor wavelet.
This letter proposes a neurobiological approach for action recognition. In this approach, actions are represented by a visual-neuron feature (VNF) based on a quantitative model of object representation in the primate visual cortex. A supervised classification technique is then used to classify the actions. The proposed VNF is invariant to affine translation and scaling of moving objects while maintaining action specificity. Moreover, it is robust to the deformation of actors. Experiments on publicly available action datasets demonstrate the proposed approach outperforms conventional action recognition models based on computer-vision features.
This paper addresses the problem of view invariant action recognition using 2D trajectories of landmark points on human body. It is a challenging task since for a specific action category, the 2D observations of different instances might be extremely different due to varying viewpoint and changes in speed. By assuming that the execution of an action can be approximated by dynamic linear combination of a set of basis shapes, a novel view invariant human action recognition method is proposed based on non-rigid matrix factorization and Hidden Markov Models (HMMs). We show that the low dimensional weight coefficients of basis shapes by measurement matrix non-rigid factorization contain the key information for action recognition regardless of the viewpoint changing. Based on the extracted discriminative features, the HMMs is used for temporal dynamic modeling and robust action classification. The proposed method is tested using real life sequences and promising performance is achieved.